Effects: partial CPS transform #1384

vouillon · 2023-01-13T15:39:16Z

We identify functions that don't involve effects by analyzing the call graph and we keep then in direct style. This relies on a global control flow analysis to find which function might be called where.

The analysis is very effective on small / monomorphic programs.

hamming is somewhat slower since it uses lazy values (we don't analyze mutable values).
nucleic is faster since the global control flow analysis is used to avoid some slow function calls. This measurement was made before Apply functions: optimizations #1358 was merged. The gap is probably narrower now.

The analysis is less effective on large programs. Higher-order functions such as List.iter are turned into CPS and then all functions that calls directly or indirectly such a function needs to be turned into CPS as well. There is also some horizontal contamination, where a function needs to be turned into CPS since it is used in a context which expects a CPS function, and then this impacts all other places it is called. Still, ocamlc is now only about ~10% slower (it is about 60% slower with the released version of Js_of_ocaml).

CAMLboy is less than 25% slower (650 FPS instead of 800 FPS).

The size of the generated code is less than 20% larger, a few percents larger when compressed. For a large Web app, I have a 44% increase of generated code (6% when compressed).

Compiling ocamlc is about 25% slower.

compiler/lib/lambda_lifting.ml

hhugo · 2023-01-13T16:23:21Z

Effects: use global flow analysis to detect more exact calls doesn't affect any expect tests. I think we should add some.

hhugo · 2023-01-13T16:27:01Z

I would be nice to be able to propagate information across units during separate compilation.

(related to #550)

hhugo · 2023-01-13T18:13:03Z

Would global_flow.ml allow to address #594 ?

vouillon · 2023-01-13T18:38:41Z

Effects: use global flow analysis to detect more exact calls doesn't affect any expect tests. I think we should add some.

I have added a test

vouillon · 2023-01-13T18:43:26Z

Would global_flow.ml allow to address #594 ?

Yes, we can use it for that. My concern is that it might be a bit expansive.
Maybe we can have a variant where information does not flow through function arguments, which would still address #594 and should be faster.

compiler/lib/effects.ml

compiler/lib/partial_cps_analysis.ml

compiler/lib/global_flow.ml

vouillon · 2023-01-19T18:52:14Z

@pmwhite Do you want to give it a try?

hhugo · 2023-01-20T17:41:01Z

I didn't read in details but It look good from there.

The performance wiki page still need to be updated (text and graphs).

I would be nice to show timings improvement for non-benchmark progams.

pmwhite · 2023-01-20T21:11:34Z

@pmwhite Do you want to give it a try?

Indeed, I would. Likely some time next week.

kayceesrk · 2023-01-21T04:56:40Z

Thanks for this work. The improvements are great.

IIUC the benchmarks don't use effect handlers. I would also be interested in seeing the improvements in programs that use effect handlers. In the PLDI 21 paper on effect handlers, we studied the performance of effect handlers using two small benchmarks -- chameneos redux (Section 6.3.1) and generators (Section 6.3.2). The source code for the benchmarks is here: https://github.com/kayceesrk/code-snippets/tree/master/eff_bench. I would be interested in seeing the performance difference between --enable=effects on trunk and this PR.

vouillon · 2023-01-23T12:43:23Z

IIUC the benchmarks don't use effect handlers. I would also be interested in seeing the improvements in programs that use effect handlers. In the PLDI 21 paper on effect handlers, we studied the performance of effect handlers using two small benchmarks -- chameneos redux (Section 6.3.1) and generators (Section 6.3.2). The source code for the benchmarks is here: https://github.com/kayceesrk/code-snippets/tree/master/eff_bench. I would be interested in seeing the performance difference between --enable=effects on trunk and this PR.

This makes a significant difference as well. Here is a quick measurement:

	generator	chameneos (1000 iterations)
master	10.3s	2.3s
this PR	6.6s	1.5s

kayceesrk · 2023-01-23T15:45:25Z

Thanks for the results. It is good to see partial CPS doing well here.

The next question is harder to answer, because it may be ill informed, but let me ask that anyway. On these benchmarks, how close to perfectly precise / optimal performance is the current partial CPS? As in, if you had a chance to only CPS those functions which are absolutely needed in these benchmarks, what would the performance be?

vouillon · 2023-01-23T17:06:29Z

The next question is harder to answer, because it may be ill informed, but let me ask that anyway. On these benchmarks, how close to perfectly precise / optimal performance is the current partial CPS? As in, if you had a chance to only CPS those functions which are absolutely needed in these benchmarks, what would the performance be?

I think the code for generator is optimal.

chameneos is using Printf, which in unnecessarily CPS-transformed. By commenting out all printf, it goes down to 1.3s, but I'm not sure how much of the difference is due to just using Printf and how much comes from the CPS transformation. Otherwise, List.map is called both with function MVar.take that needs to be in CPS and with a function that could have been kept in direct style. But that does not make any performance difference.

  let chams = List.map ~f:(fun c -> ref c) colors in
...
  let ns = List.map ~f:MVar.take fs in

kayceesrk · 2023-01-24T00:50:40Z

Thanks @vouillon for your answer. It helped me put the numbers in perspective.

compiler/tests-compiler/effects_call_opt.ml

hhugo · 2023-01-25T17:07:11Z

Would global_flow.ml allow to address #594 ?

Yes, we can use it for that. My concern is that it might be a bit expansive. Maybe we can have a variant where information does not flow through function arguments, which would still address #594 and should be faster.

Do you expect the analysis to be more expansive when effects is off ?

hhugo · 2023-01-25T17:29:19Z

@pmwhite Do you want to give it a try?

Indeed, I would. Likely some time next week.

@pmwhite,any news ?

pmwhite · 2023-01-25T18:52:34Z

@pmwhite Do you want to give it a try?

Indeed, I would. Likely some time next week.

@pmwhite,any news ?

Just tried this patch out. I've run into an issue, I believe with the lexer, which is having trouble parsing column 57 of this line. I assume this has to do with the recent changes to the lexer/parser, and not with this particular PR, but it does block me from testing this PR itself.

hhugo · 2023-01-25T19:31:30Z

@pmwhite Do you want to give it a try?

Indeed, I would. Likely some time next week.

@pmwhite,any news ?

Just tried this patch out. I've run into an issue, I believe with the lexer, which is having trouble parsing column 57 of this line. I assume this has to do with the recent changes to the lexer/parser, and not with this particular PR, but it does block me from testing this PR itself.

Should be fixed by #1395

pmwhite · 2023-01-25T20:15:42Z

I've now run into the following error:

> js_of_ocaml.exe -I . -o ./delimited_kernel.cma.js --enable with-js-error --source-map-inline --pretty ./delimited_kernel.cma
Error: Some variables escaped (#1)

This refers, I believe, to this library; hopefully that reproduces easily enough.

hhugo · 2023-01-25T20:49:27Z

I've now run into the following error:
> js_of_ocaml.exe -I . -o ./delimited_kernel.cma.js --enable with-js-error --source-map-inline --pretty ./delimited_kernel.cma
Error: Some variables escaped (#1)
This refers, I believe, to this library; hopefully that reproduces easily enough.

--debug shortvar should print the generated file on stderr with the problematic varaible of the form<var>

hhugo · 2023-02-01T06:57:53Z

@vouillon, should we merge ?

vouillon · 2023-02-01T14:21:23Z

@vouillon, should we merge ?

I still need to update the documentation...

We start from a pretty good ordering (reverse postorder is optimal when there is no loop). Then we use a queue so that we process all other nodes before coming back to a node, resulting in less iterations.

This is useful when the graph changes dynamically

We omit stack checks when jumping from one block to another within a function, except for backward edges. Stack checks are also omitted when calling the function continuations. We have to check the stack depth in `caml_alloc_stack` for the test `evenodd.ml` to succeed. Otherwise, popping all the fibers exhaust the JavaScript stack. We don't have this issue with the OCaml runtime since it allocates one stack per fiber.

I think the issue only occurs when optimization of tail recursion is enabled

We analyse the call graph to avoid turning functions into CPS when we know that they don't involve effects. This relies on a global control flow analysis to find which function might be called where.

bnguyenvanyen · 2023-02-28T12:07:42Z

I just ran these benchmarks against three versions of the compiler: "no effects", "partial CPS" which is this PR with effects enabled, and "full CPS", which is a recent-ish commit on master with effects enabled. Here are the results:

Hi, I'm surprised to see in this benchmark that partial CPS is in some cases much slower ? (The median is 0.65 faster than full CPS, but in some cases it's > 4 times slower, meaning 25/30 times slower than no CPS.
Is there an explanation ? Or am I reading the benchmark wrong ?

Here are the tests for which partial is slower:

Dynamic_cells: Focus up and down in 10 element map
Dynamic_cells: Page up and down in 101 element map
Dynamic_cells: Scroll 1-wide window from 0 to 9 and back in 1000 element map
Dynamic_cells: Apply 4 filters and clear with 101 element map using 10 window
Dynamic_cells: Apply 4 filters and clear with 10000 element map using 50 window
Dynamic_cells: Apply 4 filters and clear with 10000 element map using 100 window
Dynamic_cells: Invert ordering of 10 element map twice
Dynamic_cells: Invert ordering of 100 element map twice
Dynamic_cells: Invert ordering of 1000 element map twice
Dynamic_cells: Perform 10 sets of 1 items in a 10 element map with 10-wide window
Dynamic_cells: Perform 10 sets of 5 items in a 10 element map with 10-wide window
Dynamic_cells: Perform 10 sets of 1 items in a 11 element map with 10-wide window
Dynamic_cells: Perform 10 sets of 5 items in a 11 element map with 10-wide window
Dynamic_cells: Perform 10 sets of 1 items in a 100 element map with 10-wide window
Dynamic_cells: Perform 10 sets of 5 items in a 100 element map with 10-wide window
Dynamic_cells: Perform 10 sets of 1 items in a 1000 element map with 10-wide window
Dynamic_cells: Perform 10 sets of 5 items in a 1000 element map with 10-wide window

And a plot:

hhugo · 2023-02-28T13:12:57Z

The slower tests seems to be at the end of the table. If the order correspond the the order of execution, maybe something happened to the machine while running the tests ....

Another explanation could be that some control flow are exception based and the full cps version would throw less..

vouillon · 2023-03-01T13:12:53Z

The slower tests seems to be at the end of the table. If the order correspond the the order of execution, maybe something happened to the machine while running the tests ....

That was my hypothesis as well. We should try to reproduce this at some point. Unfortunately, compiling this benchmarks is a bit complicated when you are not at Jane Street...

hhugo reviewed Jan 13, 2023

View reviewed changes

compiler/lib/lambda_lifting.ml Show resolved Hide resolved

vouillon force-pushed the effects-improvements branch from 35129dc to d17a3e6 Compare January 13, 2023 18:37

hhugo reviewed Jan 14, 2023

View reviewed changes

compiler/lib/effects.ml Outdated Show resolved Hide resolved

hhugo reviewed Jan 14, 2023

View reviewed changes

vouillon force-pushed the effects-improvements branch 2 times, most recently from d2f9d2f to febb7c8 Compare January 19, 2023 16:38

vouillon force-pushed the effects-improvements branch 2 times, most recently from dbdece8 to af22490 Compare January 22, 2023 17:15

patricoferris mentioned this pull request Jan 23, 2023

Eio 1.0 progress tracking ocaml-multicore/eio#388

Closed

25 tasks

hhugo reviewed Jan 25, 2023

View reviewed changes

compiler/tests-compiler/effects_call_opt.ml Show resolved Hide resolved

hhugo force-pushed the effects-improvements branch from af22490 to 697de75 Compare January 25, 2023 20:08

hhugo mentioned this pull request Feb 1, 2023

Improved function arity analysis #594

Closed

vouillon force-pushed the effects-improvements branch from 241ccb6 to 3542614 Compare February 2, 2023 14:22

vouillon added 14 commits February 2, 2023 21:47

Stdlib: add List.map_last and List.iter_last

890a6f9

Add function Code.fold_closures_innermost_first

f5fab3c

Add function Code.Var.Tbl.iter

d3956c5

Dgraph: improved scheduling

58e0fed

We start from a pretty good ordering (reverse postorder is optimal when there is no loop). Then we use a queue so that we process all other nodes before coming back to a node, resulting in less iterations.

Dgraph: allow to suggest possibly updated nodes

40df250

This is useful when the graph changes dynamically

Effects: explicitly mark CPS calls

cb7580a

Effects: fix compilation of loops

6f83958

I think the issue only occurs when optimization of tail recursion is enabled

Effects: reenable optimization of tail recursion

cd3bb11

Effects: small clean-up + improve comment

287d1c9

Effects: test improved exact call detection

d4e84ce

Effects: partial CPS transform

678c56e

We analyse the call graph to avoid turning functions into CPS when we know that they don't involve effects. This relies on a global control flow analysis to find which function might be called where.

Effects: use global flow analysis to detect more exact calls

17a9049

Benchmark: graph adjustments

c4aed13

vouillon force-pushed the effects-improvements branch from 3542614 to cf53d33 Compare February 2, 2023 21:14

vouillon added 3 commits February 2, 2023 22:49

Effects: test miscompilation of exception handlers

dcebbfa

Effects: fix compilation of exception handlers

cc1896d

Documentation updates

43b3650

hhugo force-pushed the effects-improvements branch from cf53d33 to 43b3650 Compare February 2, 2023 21:49

hhugo merged commit 6e15a48 into master Feb 2, 2023

hhugo deleted the effects-improvements branch February 2, 2023 21:50

hhugo mentioned this pull request Feb 2, 2023

Improved function arity analysis #1397

Merged

OlivierNicole mentioned this pull request Jul 5, 2023

Effects: double translation of functions and dynamic switching between direct-style and CPS code #1461

Merged

Effects: partial CPS transform #1384

Effects: partial CPS transform #1384

Uh oh!

Conversation

vouillon commented Jan 13, 2023

Uh oh!

Uh oh!

hhugo commented Jan 13, 2023

Uh oh!

hhugo commented Jan 13, 2023

Uh oh!

hhugo commented Jan 13, 2023

Uh oh!

vouillon commented Jan 13, 2023

Uh oh!

vouillon commented Jan 13, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vouillon commented Jan 19, 2023

Uh oh!

hhugo commented Jan 20, 2023

Uh oh!

pmwhite commented Jan 20, 2023

Uh oh!

kayceesrk commented Jan 21, 2023

Uh oh!

vouillon commented Jan 23, 2023

Uh oh!

kayceesrk commented Jan 23, 2023

Uh oh!

vouillon commented Jan 23, 2023

Uh oh!

kayceesrk commented Jan 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

hhugo commented Jan 25, 2023

Uh oh!

hhugo commented Jan 25, 2023

Uh oh!

pmwhite commented Jan 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hhugo commented Jan 25, 2023

Uh oh!

pmwhite commented Jan 25, 2023

Uh oh!

hhugo commented Jan 25, 2023

Uh oh!

hhugo commented Feb 1, 2023

Uh oh!

vouillon commented Feb 1, 2023

Uh oh!

bnguyenvanyen commented Feb 28, 2023

Uh oh!

hhugo commented Feb 28, 2023

Uh oh!

vouillon commented Mar 1, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

kayceesrk commented Jan 24, 2023 •

edited

Loading

pmwhite commented Jan 25, 2023 •

edited

Loading